Diffusion probabilistic models - Score matching¶
Author : Philippe Esling (esling@ircam.fr)¶
This notebook explores the new class of generative models based on diffusion probabilistic models [ 1 ] . This class of models is inspired by considerations from thermodynamics [ 2 ] , but also bears strong ressemblence to denoising score matching [ 3 ] , Langevin dynamics and autoregressive decoding. We will also discuss the more recent development of denoising diffusion implicit models [ 4 ] , which bypass the need for a Markov chain to accelerate the sampling. Stemming from this work, we will also discuss the wavegrad model [ 5 ] , which is based on the same core principles but applies this class of models for audio data.
In order to fully understand the inner workings of diffusion model, we will review all of the correlated topics. Therefore, we split the explanation between four detailed notebooks.
- Score matching and Langevin dynamics.
- Diffusion probabilistic models and denoising
- Applications to waveforms with WaveGrad
- Implicit models to accelerate inference
Score matching¶
In this section we start by reviewing all the foundations on score matching that can lead to fully understand diffusion models. To do so, we will work on the traditional swiss roll dataset.
%matplotlib inline
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_swiss_roll
from helper_plot import hdr_plot_style
hdr_plot_style()
# Sample a batch from the swiss roll
def sample_batch(size, noise=1.0):
x, _= make_swiss_roll(size, noise=noise)
return x[:, [0, 2]] / 10.0
# Plot it
data = sample_batch(10**4).T
plt.figure(figsize=(16, 12))
plt.scatter(*data, alpha=0.5, color='red', edgecolor='white', s=40)
<matplotlib.collections.PathCollection at 0x28cb5d900>
The idea of score matching was originally proposed by Hyvarinen et al. [ 6 ] . Instead of learning directly the probability of the data $\log p(\mathbf{x})$, we rather aim to learn the gradients of $\log p(\mathbf{x})$ with respect to $x$. In this case, the gradients $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ are termed the score of the density $p(\mathbf{x})$, hence the name score matching. This can be understood as learning the direction of highest probability at each point in the input space. Therefore, when the model is trained, we can improve a sample $x$ by moving it along the directions of highest probability.
However, to perform training, we would need to minimize the error of our model $\mathcal{F}_{\theta}(\mathbf{x})$ at predicting the gradient $\nabla_{\mathbf{x}} \log p(\mathbf{x})$, which amounts to minimize the Fisher divergence, or simply the MSE
$$ \mathcal{L}_{mse} = E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\lVert \mathcal{F}_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log p(\mathbf{x}) \right\lVert_2^2 \right] $$
The real $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is usually unknown, but under regularity assumptions on $p(\mathbf{x})$, it can be shown that the minimum of $\mathcal{L}_{mse}$ can be found through a tractable objective
$$ \mathcal{L}_{matching} = E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right) + \frac{1}{2} \left\Vert \mathcal{F}_{\theta}(\mathbf{x}) \right\lVert_2^2 \right] , $$
where $\nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x})$ denotes the Jacobian of $\mathcal{F}_{\theta}(\mathbf{x})$ with respect to $\mathbf{x}$, and $ \text{tr}(\cdot) $ is the trace operation.
To perform this optimization, we will rely on Pytorch and define $\mathcal{F}_{\theta}(\mathbf{x})$ as being a neural network
import torch
import torch.nn as nn
import torch.optim as optim
# Our approximation model
model = nn.Sequential(
nn.Linear(2, 128), nn.Softplus(),
nn.Linear(128, 128), nn.Softplus(),
nn.Linear(128, 2)
)
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
Next, we need to define the loss function for the score matching objective. First to compute the Jacobian, we need a specific (and differentiable) function. (This efficient implementation is based on a discussion that can be found here)
import torch.autograd as autograd
def jacobian(f, x):
"""Computes the Jacobian of f w.r.t x.
:param f: function R^N -> R^N
:param x: torch.tensor of shape [B, N]
:return: Jacobian matrix (torch.tensor) of shape [B, N, N]
"""
B, N = x.shape
y = f(x)
jacobian = list()
for i in range(N):
v = torch.zeros_like(y)
v[:, i] = 1.
dy_i_dx = autograd.grad(y, x, grad_outputs=v, retain_graph=True, create_graph=True, allow_unused=True)[0] # shape [B, N]
jacobian.append(dy_i_dx)
jacobian = torch.stack(jacobian, dim=2).requires_grad_()
return jacobian
The actual score matching loss function is split between the norm of the model output $\frac{1}{2} \left\Vert \mathcal{F}_{\theta}(\mathbf{x}) \right\lVert_2^2$, which we compute first. Then, we compute the trace of the Jacobian loss $\text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right)$ and return the sum as the full loss.
def score_matching(model, samples, train=False):
samples.requires_grad_(True)
logp = model(samples)
# Compute the norm loss
norm_loss = torch.norm(logp, dim=-1) ** 2 / 2.
# Compute the Jacobian loss
jacob_mat = jacobian(model, samples)
tr_jacobian_loss = torch.diagonal(jacob_mat, dim1=-2, dim2=-1).sum(-1)
return (tr_jacobian_loss + norm_loss).mean(-1)
Finally, we can run our code to train the model
dataset = torch.tensor(data.T).float()
for t in range(2000):
# Compute the loss.
loss = score_matching(model, dataset)
# Before the backward pass, zero all of the network gradients
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to parameters
loss.backward()
# Calling the step function to update the parameters
optimizer.step()
if ((t % 500) == 0):
print(loss)
tensor(0.4140, grad_fn=<MeanBackward1>) tensor(-13.6779, grad_fn=<MeanBackward1>) tensor(-38.8591, grad_fn=<MeanBackward1>) tensor(-48.4584, grad_fn=<MeanBackward1>)
We can observe that our model has learned to represent $\mathcal{F}_{\theta}(\mathbf{x}) \approx \nabla_x \log p(x)$ by plotting the output value across the input space
def plot_gradients(model, data, plot_scatter=True):
xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis=-1).reshape(-1, 2)
scores = model(torch.tensor(xx).float()).detach()
scores_norm = np.linalg.norm(scores, axis=-1, ord=2, keepdims=True)
scores_log1p = scores / (scores_norm + 1e-9) * np.log1p(scores_norm)
# Perform the plots
plt.figure(figsize=(16,12))
if (plot_scatter):
plt.scatter(*data, alpha=0.3, color='red', edgecolor='white', s=40)
plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='white')
plt.xlim(-1.5, 2.0)
plt.ylim(-1.5, 2.0)
plot_gradients(model, data)
Langevin dynamics¶
After training, our model is able to produce an approximation of the gradient of the probabiliity, such that $\mathcal{F}_{\theta}(\mathbf{x}) \approx \nabla_x \log p(x)$. Therefore, we could use this to generate data by relying on a simple gradient ascent from a given point by using an initial sample $\mathbf{x}_{0} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$, and then using the gradient information to find a local maximum of $p(\mathbf{x})$
$$\mathbf{x}_{t + 1} = \mathbf{x}_t + \epsilon \nabla_{\mathbf{x}_t} log p(\mathbf{x}_t)$$
where $\epsilon$ defines the size of the step we take in the direction of the gradient (akin to the learning rate).
def sample_simple(model, x, n_steps=20, eps=1e-3):
x_sequence = [x.unsqueeze(0)]
for s in range(n_steps):
x = x + eps * model(x)
x_sequence.append(x.unsqueeze(0))
return torch.cat(x_sequence)
x = torch.Tensor([1.5, -1.5])
samples = sample_simple(model, x).detach()
plot_gradients(model, data)
plt.scatter(samples[:, 0], samples[:, 1], color='green', edgecolor='white', s=150)
# draw arrows for each step
deltas = (samples[1:] - samples[:-1])
deltas = deltas - deltas / np.linalg.norm(deltas, keepdims=True, axis=-1) * 0.04
for i, arrow in enumerate(deltas):
plt.arrow(samples[i,0], samples[i,1], arrow[0], arrow[1], width=1e-4, head_width=2e-2, color="green", linewidth=3)
However, the previous procedure does not produce a true sample from $\mathbf{x} \sim p(\mathbf{x})$. In order to obtain such a sample, we can rely on a special case of Langevin dynamics. In this case, Langevin dynamics can produce true samples from the density $p(\mathbf{x})$, by relying only on $\nabla_{\mathbf{x}} \log p(\mathbf{x})$. The sampling is defined in a way very similar to MCMC approaches, by applying recursively
$$\mathbf{x}_{t + 1} = \mathbf{x}_t + \frac{\epsilon}{2} \nabla_{\mathbf{x}_t} log p(\mathbf{x}_t) + \sqrt{\epsilon} \mathbf{z}_{t}$$
where $\mathbf{z}_{t}\sim \mathcal{N}(\mathbf{0},\mathbf{I})$. It has been shown in Welling et al. (2011) that under $\epsilon \rightarrow 0, t \rightarrow \inf$: $\mathbf{x}_t$ converges to an exact sample from $p(\mathbf{x})$. This is a key idea behind the score-based generative modeling approach.
In order to implement this sampling procedure, we can once again start from $\mathbf{x}_{0} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$, and progressively anneal $\epsilon \rightarrow 0$ at each step, to obtain true samples from $p(\mathbf{x})$.
def sample_langevin(model, x, n_steps=10, eps=1e-2, decay=.9, temperature=1.0):
x_sequence = [x.unsqueeze(0)]
for s in range(n_steps):
z_t = torch.rand(x.size())
x = x + (eps / 2) * model(x) + (np.sqrt(eps) * temperature * z_t)
x_sequence.append(x.unsqueeze(0))
eps *= decay
return torch.cat(x_sequence)
x = torch.Tensor([1.5, -1.5])
samples = sample_langevin(model, x).detach()
plot_gradients(model, data)
plt.scatter(samples[:, 0], samples[:, 1], color='green', edgecolor='white', s=150)
# draw arrows for each mcmc step
deltas = (samples[1:] - samples[:-1])
deltas = deltas - deltas / np.linalg.norm(deltas, keepdims=True, axis=-1) * 0.04
for i, arrow in enumerate(deltas):
plt.arrow(samples[i,0], samples[i,1], arrow[0], arrow[1], width=1e-4, head_width=2e-2, color="green", linewidth=3)
Sliced score matching¶
The previously defined score matching with this loss is not scalable to high-dimensional data, nor deep networks, because of the computation of $\text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right)$. Indeed, the computation of the Jacobian is a $O(N^2 + N)$ operation, thus not being suitable for high-dimensional problems, even with the optimized solution proposed in the previous code. More recently, Song et al. [ 7 ] proposed to use random projections to approximate the computation of $\text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right)$ in score matching. This approach called Sliced Score Matching allows to replace the optimization objective with
$$ E_{\mathbf{v} \sim \mathcal{N}(0, 1)} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \mathbf{v}^T \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \mathbf{v} + \frac{1}{2} \left\Vert \mathbf{v}^T \mathcal{F}_{\theta}(\mathbf{x}) \right\lVert_2^2 \right] , $$
where $\mathbf{v} \sim \mathcal{N}(0, 1)$ are a set of Normal-distributed vectors. They show that this can be computed by using forward mode auto-differentiation, which is computationally efficient. This loss can be implemented as follows (where we compute once again the norm first, and then the Jacobian loss)
def sliced_score_matching(model, samples):
samples.requires_grad_(True)
# Construct random vectors
vectors = torch.randn_like(samples)
vectors = vectors / torch.norm(vectors, dim=-1, keepdim=True)
# Compute the optimized vector-product jacobian
logp, jvp = autograd.functional.jvp(model, samples, vectors, create_graph=True)
# Compute the norm loss
norm_loss = (logp * vectors) ** 2 / 2.
# Compute the Jacobian loss
v_jvp = jvp * vectors
jacob_loss = v_jvp
loss = jacob_loss + norm_loss
return loss.mean(-1).mean(-1)
As previously, we can perform a simple optimization of this loss given a set of examples.
# Our approximation model
model = nn.Sequential(
nn.Linear(2, 128), nn.Softplus(),
nn.Linear(128, 128), nn.Softplus(),
nn.Linear(128, 2)
)
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
dataset = torch.tensor(data.T)[:1000].float()
for t in range(2000):
# Compute the loss.
loss = sliced_score_matching(model, dataset)
# Before the backward pass, zero all of the network gradients
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to parameters
loss.backward()
# Calling the step function to update the parameters
optimizer.step()
# Print loss
if ((t % 500) == 0):
print(loss)
tensor(0.0436, grad_fn=<MeanBackward1>) tensor(-1.1006, grad_fn=<MeanBackward1>) tensor(-6.4248, grad_fn=<MeanBackward1>) tensor(-9.4135, grad_fn=<MeanBackward1>)
We can check our approximation by relying on the same function as before.
plot_gradients(model, data)
Denoising score matching¶
Originally, the notion of denoising score matching was discussed by Vincent [ 3 ] in the context of denoising auto-encoders. In that case, this allows to completely remove the use of $\nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x})$ in the computation of score matching. To do so, we can first corrupt the input point $\mathbf{x}$ with a given noise vector, leading to a distribution $q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})$. Then, score matching can be used to estimate the score of this perturbed data distribution. It has been shown in [3], that the optimal network that approximates $\mathcal{F}_{\theta}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$ can be found by minimizing the following objective
$$ E_{q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\Vert \mathcal{F}_{\theta}(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) \right\lVert_2^2 \right] , $$
However, it should be noted that $\mathcal{F}_{\theta}(\mathbf{x}) = \nabla_{\mathbf{x}} \log q_{\sigma}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$ is only true when the noise is small enough to consider that $q_{\sigma}(\mathbf{x}) \approx p(\mathbf{x})$. As it has been shown in [ 3 ] , [ 8 ] , if we choose the noise distribution to be $q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})=\mathcal{N}(\tilde{\mathbf{x}}\mid\mathbf{x}, \sigma^{2}\mathbf{I})$, then we have $\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) = \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^{2}}$. Therefore, the denoising score matching loss simply becomes
$$ \mathcal{l}(\theta;\sigma) = E_{q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\Vert \mathcal{F}_{\theta}(\tilde{\mathbf{x}}) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^{2}} \right\lVert_2^2 \right] , $$
We can implement the denoising score matching loss as follows
def denoising_score_matching(scorenet, samples, sigma=0.01):
perturbed_samples = samples + torch.randn_like(samples) * sigma
target = - 1 / (sigma ** 2) * (perturbed_samples - samples)
scores = scorenet(perturbed_samples)
target = target.view(target.shape[0], -1)
scores = scores.view(scores.shape[0], -1)
loss = 1 / 2. * ((scores + target) ** 2).sum(dim=-1).mean(dim=0)
return loss
We rely on the same model and optimizer as before
# Our approximation model
model = nn.Sequential(
nn.Linear(2, 128), nn.Softplus(),
nn.Linear(128, 128), nn.Softplus(),
nn.Linear(128, 2)
)
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
dataset = torch.tensor(data.T).float()
for t in range(2000):
# Compute the loss.
loss = denoising_score_matching(model, dataset)
# Before the backward pass, zero all of the network gradients
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to parameters
loss.backward()
# Calling the step function to update the parameters
optimizer.step()
# Print loss
if ((t % 1000) == 0):
print(loss)
tensor(10007.4111, grad_fn=<MulBackward0>) tensor(10055.4170, grad_fn=<MulBackward0>)
plot_gradients(model, data)
Noise conditional score networks¶
In a recent paper, Song and Ermon [ 8 ] built on these ideas to develop a score-based generative framework called Noise-Conditional Score Networks (NCSN). They underlined several flaws in the existing score matching objectives. First, they showed that the use of (sliced) score matching without noise was inconsistent when the data followed the manifold hypothesis, ie. the noiseless objectives were only consistent when the distribution spanned the whole space. Second, they showed that low density regions could cause difficulties for both score matching and sampling with Langevin dynamics.
To address these issues, they propose to rely on perturbed data to bypass the manifold issues, but also to simultaneously estimate scores corresponding to all noise levels by training a single conditional score network. To do so, we consider a positive geometric sequence of noise variances $\{\sigma_{i}\}_{i=1}^{L}$, choosing $\sigma_{1}$ to be large enough to mitigate manifold issue, and satisfying $\frac{\sigma_{1}}{\sigma_{2}} = \cdots = \frac{\sigma_{L-1}}{\sigma_{L}} > 1$. The goal is to train a conditional network to estimate the gradients of all perturbed data distributions, ie. $$ \forall \sigma \in \{\sigma_{i}\}_{i=1}^{L}, \mathcal{F}_{\theta}(\tilde{\mathbf{x}}, \sigma) \approx \nabla_{\mathbf{x}} \log q_{\sigma}(\mathbf{x}) $$ The network learning this is called a Noise Conditional Score Network (NCSN). To perform this optimization, we start from the previously-defined denoising score matching objective $$ \mathcal{l}(\theta;\sigma) = E_{q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\Vert \mathcal{F}_{\theta}(\tilde{\mathbf{x}}, \sigma) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^{2}} \right\lVert_2^2 \right] , $$ As this adress a single noise level, this objective can be combined to obtain a single unified objective as $$ \mathcal{L}(\theta;\{\sigma_{i}\}_{i=1}^{L}) = \frac{1}{L} \sum_{i=1}^{L} \lambda(\sigma_{i})\mathcal{l}(\theta;\sigma_{i}) $$ where $\lambda(\sigma_{i}) > 0$ is a coefficient function depending on $\sigma_{i}$. This objective can be implemented as follows
def anneal_dsm_score_estimation(model, samples, labels, sigmas, anneal_power=2.):
used_sigmas = sigmas[labels].view(samples.shape[0], *([1] * len(samples.shape[1:])))
perturbed_samples = samples + torch.randn_like(samples) * used_sigmas
target = - 1 / (used_sigmas ** 2) * (perturbed_samples - samples)
scores = model(perturbed_samples, labels)
target = target.view(target.shape[0], -1)
scores = scores.view(scores.shape[0], -1)
loss = 1 / 2. * ((scores - target) ** 2).sum(dim=-1) * used_sigmas.squeeze() ** anneal_power
return loss.mean(dim=0)
As can be seen in the previous equation, this model requires a conditional network $\mathcal{F}_{\theta}(\tilde{\mathbf{x}}, \sigma_{i})$ that also takes as input the different noise levels $\sigma_{i}$. Therefore, we redefine the previous model to account for that
import torch.nn.functional as F
class ConditionalLinear(nn.Module):
def __init__(self, num_in, num_out, num_classes):
super().__init__()
self.num_out = num_out
self.lin = nn.Linear(num_in, num_out)
self.embed = nn.Embedding(num_classes, num_out)
self.embed.weight.data.uniform_()
def forward(self, x, y):
out = self.lin(x)
gamma = self.embed(y)
out = gamma.view(-1, self.num_out) * out
return out
class ConditionalModel(nn.Module):
def __init__(self, num_classes):
super().__init__()
self.lin1 = ConditionalLinear(2, 128, num_classes)
self.lin2 = ConditionalLinear(128, 128, num_classes)
self.lin3 = nn.Linear(128, 2)
def forward(self, x, y):
x = F.softplus(self.lin1(x, y))
x = F.softplus(self.lin2(x, y))
return self.lin3(x)
Finally, we can perform the same optimization as before
sigma_begin = 1
sigma_end = 0.01
num_classes = 4
sigmas = torch.tensor(np.exp(np.linspace(np.log(sigma_begin), np.log(sigma_end), num_classes))).float()
# Our approximation model
model = ConditionalModel(num_classes)
dataset = torch.tensor(data.T).float()
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for t in range(5000):
# Compute the loss.
labels = torch.randint(0, len(sigmas), (dataset.shape[0],))
loss = anneal_dsm_score_estimation(model, dataset, labels, sigmas)
# Before the backward pass, zero all of the network gradients
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to parameters
loss.backward()
# Calling the step function to update the parameters
optimizer.step()
# Print loss
if ((t % 1000) == 0):
print(loss)
tensor(1.0236, grad_fn=<MeanBackward1>) tensor(0.8097, grad_fn=<MeanBackward1>) tensor(0.7809, grad_fn=<MeanBackward1>) tensor(0.7596, grad_fn=<MeanBackward1>) tensor(0.7753, grad_fn=<MeanBackward1>)
xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis=-1).reshape(-1, 2)
labels = torch.randint(0, len(sigmas), (xx.shape[0],))
scores = model(torch.tensor(xx).float(), labels).detach()
scores_norm = np.linalg.norm(scores, axis=-1, ord=2, keepdims=True)
scores_log1p = scores / (scores_norm + 1e-9) * np.log1p(scores_norm)
# Perform the plots
plt.figure(figsize=(16,12))
plt.scatter(*data, alpha=0.3, color='red', edgecolor='white', s=40)
plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='white')
plt.xlim(-1.5, 2.0)
plt.ylim(-1.5, 2.0);
xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis=-1).reshape(-1, 2)
labels = torch.ones(xx.shape[0]).long()
scores = model(torch.tensor(xx).float(), labels).detach()
scores_norm = np.linalg.norm(scores, axis=-1, ord=2, keepdims=True)
scores_log1p = scores / (scores_norm + 1e-9) * np.log1p(scores_norm)
# Perform the plots
plt.figure(figsize=(16,12))
plt.scatter(*data, alpha=0.3, color='red', edgecolor='white', s=40)
plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='white')
plt.xlim(-1.5, 2.0)
plt.ylim(-1.5, 2.0)
(-1.5, 2.0)